Introduction

This report is the first one to document and study the feasability of the automatic quality evaluation of experimental literature investigating bio–nano interactions. The first step of this automatic evaluation is to isolate the section Materials and Methods. The goal is to use later this section only to assess if the characterisation of the nano-materials is done and ebaluate the quality of the articles.

This report contain preliminary analyses and exploration of the data contained in the corpus of text. The first goal of this analyses is to gain some understanding of the structure of the texts inside the corpus of articles and the relations of the lemmas “material(s)” and “method(s)” to this corpus.

The second goal is to investigate how to discriminate the beginning of the section “Materials and methods”. The main problem to identify entry of the section Materials and Methods is that some of this two words can be present in the text of the article (typically “cf” material and methods").

The corpus of text has been created from the 751 articles from the folder “Full Text dev set”, which contain 751 articles converted into txt file format. The others articles are kept unseen to test the efficacy of any other tools developped later in “real life condition”.

Few definitions to frame the problem :

A quick exploratory data analysis on the article Abrams, MT et al, 2010, led to think that the the “materials” token from the section material and method has a specific property : is head_token_id is equal to zero, i.e. the “head” of this word is itself (cf example under). This led to think that sections titles of aritcles may have this property. This hypothesis will be test in the first part of this report, and in a later section, for the lemma “materials” and “material” (Co-occurences for materials and material when their head_token_id = 0)

In the later section, we will try differents criteria to isolate some lemmas “materials”, “material”, “methods” and “method”. We will use a technic, co-occurences, to explore the surronding of the differents lemmas in the text and evaluate if this criteria allow to discriminate the beginning of the section materials and methods from the remaining of the article.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a good way to create informal reports describing data analysis projects as a web page, and a good way to mix code and description in a readable maner. There is even books in this format, ranging from Data Analysis for the Life Sciences to Text Mining with R, A Tidy Approach, so anybody can understand and retake this work. This report is also code, it can be recompiled with new data (including an other model for the annotation of the corpus).

Import and datastructure

library(udpipe)
library(lattice)
library(wordcloud)
library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)

The following lines load the corpus of text, already annotated and tokenized :

x <- readRDS(file = "annotation_partut.rds")
x <- as.data.frame(x)
length(unique(x$doc_id))
## [1] 751

Here an example of a token “materials” with a head_token_id = 0 :

x[7467,]
##      doc_id paragraph_id sentence_id
## 7467   doc1          602         824
##                                                                                                                                               sentence
## 7467 The ethanol was removed using tangential flow filtration, followed by buffer exchange so that the particle is in 100% ­phosphate-buffered saline.
##      token_id token lemma upos xpos        feats head_token_id dep_rel
## 7467        5 using   use VERB    V VerbForm=Ger             4   advcl
##      deps misc
## 7467 <NA> <NA>

Words with head_token_id == 0

Considering the observation that, in “Materials and Methods” the head_token_ID was 0 for the token “Materials”, one idea was to explore what are, in the corpus of texts, the most common lemma with a head_token_ID equal to zero.

The expected outcome of this analysis could be to retrieve the usual sections title of scientific articles inside the most common words, like Abstract or Results. The goal is to assess if it is a consistent property of the titles of section inside the articles and uncover potential synonyms to “materials and methods” like “experimental section”.

stats <- subset(x, head_token_id == 0) #https://bnosac.github.io/udpipe/docs/doc7.html
stats <- txt_freq(x = stats$lemma)

stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0", xlab = "Freq")

Nonetheless, it seems that this assumption was quite naive, as lot of token have this property. Let’s filter for specific lemmas that correspond to usual title of section, like abstract of results :

stats<-stats %>% filter(key %in% c("material", "materials", "result", "results", "abstract", "introduction" , "method", "methods", "discussion", "references"))

stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Count of lemma for usual sections name with Head_token_id = 0", xlab = "Freq")

stats
##             key freq     freq_pct
## 1        result 1781 0.3014531021
## 2        method  674 0.1140816344
## 3      material  406 0.0687197976
## 4    discussion  224 0.0379143711
## 5  introduction  169 0.0286050389
## 6     materials   50 0.0084630293
## 7       methods   48 0.0081245081
## 8      abstract   22 0.0037237329
## 9       results    9 0.0015233453
## 10   references    2 0.0003385212

Some section titles seems to have the afored mentionned property. Nonetheless, the number does not match the total number of articles in this corpus (751). To take the example of the token discussion, or some articles does not have a section dicussion, or, more probably, the token discussion does not have the property mentionned earlier. We can answer this question :

occurrences<-which(x$lemma=="discussion")
length(occurrences)
## [1] 899
length(unique(x[occurrences,]$doc_id))
## [1] 707

There is 899 occurrences of the word discussion in all the corpus, and 707 article with this word. It seems really likely that discriminating tokens that are section titles just with a head token ID of zero is not sufficient.

Visualize the most recurent head_token_id of the lemma material, materials, method and methods

To explore the relationships of the lemmas “material(s)” and “method(s)” with the rest of the corpus, we can analyse what are the most recurents head tokens for the lemmas “material” and “materials”. The goals of the analysis are :

  • to observe if the lemma “material(s)” is often associated as head with the lemma “material(s)” and with which frequency
  • to observe what are the other lemma that are commonly the head of the lemma material(s)
  • same question(s) for the lemma “method” and “methods”

Lemma material

grep_lemma_head_token_id <- function(index){
  #catch the lemma corresponding to the head_token_id of the token at the entry "index" of x
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  head_token_id<-occurrence$head_token_id
  head_token_id<-as.numeric(head_token_id)
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following line query the lemma of the head_token_id based on the previous parameters
  lemma_head_token_id<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[head_token_id],]$lemma
  if (head_token_id==0) {lemma_head_token_id=occurrence$lemma}
  return(lemma_head_token_id)
}

material_occurrences<-which(x$lemma=="material")
head_token_lemmas<-sapply(material_occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 

stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring lemma corresponding to the head_token_id \n for lemma material", xlab = "Freq")

Lemma materials

occurrences<-which(x$lemma=="materials") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 

stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma materialS with an s", xlab = "Freq")

Lemma method

occurrences<-which(x$lemma=="method") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 


stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")

Lemma methods

occurrences<-which(x$lemma=="methods") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 


stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")

head(stats, 10)
##    head_token_lemmas Freq         key
## 49          material   72    material
## 55           methods   48     methods
## 7                and   47         and
## 26         described   23   described
## 45                 j   21           j
## 86           Toxicol   18     Toxicol
## 25          describe   17    describe
## 73          question   15    question
## 68       preparation   14 preparation
## 78               see   14         see

Co-occurences

Co-occurences for material(s) and method(s)

In the next sessions we test differents criteria to discriminate the lemmas “materials” and “material” inside the articles. The idea is to find a criteria that allow to identify the beginning of the section “materials and methods”.

Co-occurrence is an analysis that allow to see how words are used either in the same sentence or next to each other. We will use this approach to have a sense of what is the neighbourhood of the lemmas we isolated based on each criteria.

There is several type of cooccurrences analysis : * Looking at which words are located in the same document/sentence/paragraph. * Looking at which words are followed by another word. * Looking at which words are in the neighbourhood of the word as in follows the word within skipgram number of words.

Cf doc of the package Updipe for the three possible use. We will use the second approach, as it is the most relevant to our goal and as it is the most simple to interpret. Differents skipgram can be used to got an idea of the distance or more proximal neighbourhood.

The two function above are meant to gain some place in the document. The first one plot the word network, a common technique to visualise word cooccurrences, after the filtration of the cooccurrences that concerns only the lemma of interrest.

plot_cooccurrence <- function(stats, lemma, title){
  #function to gain place and make this Rmarkdown document more clear
  stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
  wordnetwork <- head(stats, 30)
  wordnetwork <- graph_from_data_frame(wordnetwork)
  ggraph(wordnetwork, layout = "fr") +
    geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
    geom_node_text(aes(label = name), col = "blue", size = 5) +
    theme_graph(base_family = "Helvetica") +
    theme(legend.position = "none") +
    labs(title = title)
}
head_cooc <- function(stats, lemma){
  #function to gain place and make this Rmarkdown document more clear
  stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
  head(stats, 30)
}
stats <- cooccurrence(x = x$lemma, skipgram = 0)

Bigger skipgram were not really relevant. Here we can simply count the elements of the dataframe stats to see how many times each word follow each other.

plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials")

head_cooc(stats, lemma="materials")
##         term1            term2 cooc
## 1   materials                &   60
## 2   materials          science   58
## 3       apply        materials   54
## 4          of        materials   42
## 5   materials         research   35
## 6  Biomedical        materials   33
## 7           .        materials   30
## 8         and        materials   19
## 9           /        materials   14
## 10  materials                /   14
## 11    methods        materials   13
## 12     method        materials    8
## 13         in        materials    8
## 14  materials        Chemistry    7
## 15  materials              and    7
## 16       this        materials    6
## 17        for        materials    6
## 18  materials characterization    5
## 19  materials                ,    5
## 20  materials          section    3
## 21  materials          Science    3
## 22  materials             inc.    3
## 23  materials       commercial    3
## 24  materials      engineering    2
## 25          ,        materials    2
## 26       bulk        materials    2
## 27  reference        materials    2
## 28  materials        catalogue    2
## 29    various        materials    2
## 30  materials       technology    2
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material")

head_cooc(stats, lemma="material")
##            term1    term2 cooc
## 1       material      and  800
## 2              . material  513
## 3       material        .  453
## 4       material        ,  391
## 5       material       be  347
## 6             of material  246
## 7            the material  242
## 8           this material  203
## 9           test material  156
## 10      material       in  155
## 11      material        (  154
## 12      material       at  137
## 13 Supplementary material  109
## 14 supplementary material  105
## 15      material      for   82
## 16          bulk material   71
## 17      material     have   65
## 18           and material   61
## 19      material     that   60
## 20      nanotube material   56
## 21       foreign material   53
## 22      material     with   53
## 23     reference material   50
## 24      material        :   44
## 25             t material   41
## 26      material       to   39
## 27      material       on   39
## 28       genetic material   39
## 29            in material   37
## 30      material      the   33
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods")

head_cooc(stats, lemma="methods")
##          term1        term2 cooc
## 1          and      methods  151
## 2      methods            .   52
## 3           in      methods   45
## 4            .      methods   39
## 5      methods            ,   35
## 6      Immunol      methods   30
## 7         Mech      methods   21
## 8            :      methods   19
## 9      methods     material   17
## 10           ,      methods   16
## 11         see      methods   16
## 12     methods            )   15
## 13     methods    materials   13
## 14     methods      section   12
## 15     methods            :   11
## 16     methods          2.1    9
## 17     methods  preparation    7
## 18     methods         test    7
## 19     methods          the    7
## 20 alternative      methods    7
## 21     methods     Chemical    6
## 22     methods       animal    6
## 23     methods          and    6
## 24     methods          For    6
## 25     methods    Downloade    6
## 26     methods            (    5
## 27         the      methods    5
## 28     methods           in    5
## 29     methods Nanoparticle    5
## 30           [      methods    5
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method")

head_cooc(stats, lemma="method")
##        term1    term2 cooc
## 1        and   method  526
## 2     method        .  507
## 3     method      for  471
## 4        the   method  448
## 5          .   method  394
## 6     method       of  317
## 7     method       be  278
## 8     method        ,  254
## 9     method       to  229
## 10    method        (  214
## 11      this   method  151
## 12    method      and  137
## 13    method      2.1  134
## 14    method      use  130
## 15    method describe  126
## 16    method        :  107
## 17         )   method  101
## 18         a   method   94
## 19      test   method   80
## 20    method        [   80
## 21    method       in   79
## 22    method     have   61
## 23    method        )   51
## 24    method       as   51
## 25  analytic   method   47
## 26 sensitive   method   46
## 27    method     with   44
## 28    method      Mol   41
## 29     vitro   method   39
## 30    method   animal   38

Co-occurences, visualization of all the lemma of interest

plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1    term2 cooc
## 1       material      and  800
## 2            and   method  526
## 3              . material  513
## 4         method        .  507
## 5         method      for  471
## 6       material        .  453
## 7            the   method  448
## 8              .   method  394
## 9       material        ,  391
## 10      material       be  347
## 11        method       of  317
## 12        method       be  278
## 13        method        ,  254
## 14            of material  246
## 15           the material  242
## 16        method       to  229
## 17        method        (  214
## 18          this material  203
## 19          test material  156
## 20      material       in  155
## 21      material        (  154
## 22          this   method  151
## 23           and  methods  151
## 24        method      and  137
## 25      material       at  137
## 26        method      2.1  134
## 27        method      use  130
## 28        method describe  126
## 29 Supplementary material  109
## 30        method        :  107

Co-occurences for materials and material when their head_token_id = 0

Similar to the previous approach, we want to explore the relationships of the differents lemma with their neighbourhood in the corpus of text, but we restrict the analysis for sentences for which the lemma material or materials is the head token of itself.

Even if not all the “Materials and Methods” section titles has a “materials” lemma with a head_token_id equal to zero, the opposite could be true.

Here, by restricting to the lemmas “materials” and “material” which have a head_token_id = 0, we can visualize their statistical association with other words and understand if this subsets of token is really delimiting the beginning of section “material and methods”.

The first function allow to filter for sentences where the lemma material or materials is the head. The following lines calculate the co-occurrences and draw the plot as previously.

create_subset_corpus<- function(index){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for materials and material when their head_token_id = 0
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following lines collect the head_token_id and test if is equal to zero
  #if so, its output the tokens of the sentences
  head_token_id<-occurrence$head_token_id
  if (head_token_id==0) {return(strip_corpus(doc_id, sentence_id))} 
  return()
}

strip_corpus <- function(doc_id, sentence_id){
  #this function returns all the lemma of a sentence, in the appropriate format
  #the purpose of doing so is to allow for calculation of cooccurence of words inside this sentences
  #for this we need all the elements of the sentence
  sentence_id<-as.numeric(sentence_id)
  subset_article<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id),]
  return(subset_article)
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="materials")
##            term1     term2 cooc
## 1          apply materials   45
## 2      materials         &   45
## 3             of materials    2
## 4      materials   science    1
## 5      materials   Science    1
## 6        methods materials    1
## 7      materials         .    1
## 8              ; materials    1
## 9      materials         ,    1
## 10 Supplementary materials    1
## 11     materials available    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas :  materials, material, method, method, \n when head_token_id of lemma materials is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1     term2 cooc
## 1          apply materials   45
## 2      materials         &   45
## 3             of materials    2
## 4      materials   science    1
## 5      materials   Science    1
## 6      Interface   methods    1
## 7        methods materials    1
## 8      materials         .    1
## 9              ; materials    1
## 10     materials         ,    1
## 11 Supplementary materials    1
## 12     materials available    1
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when its head_token_id is equal to 0\n when its head_token_id is equal to 0")

head_cooc(stats, lemma="material")
##            term1    term2 cooc
## 1       material      and  188
## 2              . material  139
## 3       material        .   41
## 4  supplementary material   36
## 5       nanotube material   32
## 6       material material   30
## 7         method material   28
## 8              : material   26
## 9       material        :   25
## 10      material      for   16
## 11          test material   11
## 12      material       in    9
## 13      material  science    9
## 14      material        ,    8
## 15             ; material    7
## 16      material        (    5
## 17 Supplementary material    5
## 18      material       to    5
## 19      material       of    5
## 20     important material    5
## 21      material     that    4
## 22      material     with    4
## 23    Mesoporous material    4
## 24     composite material    4
## 25          from material    4
## 26             / material    4
## 27             , material    3
## 28       methods material    3
## 29      material        &    3
## 30      material        5    3
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma material is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1         term2 cooc
## 1       material           and  188
## 2            and        method  153
## 3              .      material  139
## 4       material             .   41
## 5         method           2.1   40
## 6  supplementary      material   36
## 7       nanotube      material   32
## 8       material      material   30
## 9         method      material   28
## 10             :      material   26
## 11      material             :   25
## 12           and       methods   20
## 13      material           for   16
## 14        method        animal   13
## 15          test      material   11
## 16      material            in    9
## 17      material       science    9
## 18      material             ,    8
## 19             ;      material    7
## 20        method      Chemical    7
## 21        method             :    6
## 22      material             (    5
## 23 Supplementary      material    5
## 24      material            to    5
## 25      material            of    5
## 26     important      material    5
## 27      material          that    4
## 28        method   preparation    4
## 29        method Nanoparticles    4
## 30      material          with    4

Co-occurences for methods and method when their head_token_id = 0

occurrences<-which(x$lemma=="methods")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="methods")
##           term1        term2 cooc
## 1       Immunol      methods   16
## 2             .      methods   13
## 3       methods            :    7
## 4             :      methods    4
## 5       methods          for    3
## 6     Microbiol      methods    2
## 7  experimental      methods    2
## 8       methods          2.1    2
## 9       methods          159    2
## 10      Toxicol      methods    2
## 11      methods          204    1
## 12      methods          24:    1
## 13      methods            (    1
## 14      methods   1983;65:55    1
## 15        Virol      methods    1
## 16      methods       115:99    1
## 17      methods           63    1
## 18      methods experimental    1
## 19      methods      2008;73    1
## 20      methods            ,    1
## 21      methods           65    1
## 22      culture      methods    1
## 23      methods   1988;11:15    1
## 24      methods            .    1
## 25      methods          278    1
## 26      methods      2010;62    1
## 27      methods           78    1
## 28      methods          101    1
## 29      methods           62    1
## 30      methods         2011    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma methods is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##           term1        term2 cooc
## 1       Immunol      methods   16
## 2             .      methods   13
## 3       methods            :    7
## 4             :      methods    4
## 5       methods          for    3
## 6     Microbiol      methods    2
## 7  experimental      methods    2
## 8       methods          2.1    2
## 9       methods          159    2
## 10      Toxicol      methods    2
## 11      methods          204    1
## 12      methods          24:    1
## 13      methods            (    1
## 14      methods   1983;65:55    1
## 15        Virol      methods    1
## 16      methods       115:99    1
## 17      methods           63    1
## 18      methods experimental    1
## 19      methods      2008;73    1
## 20      methods            ,    1
## 21            t     material    1
## 22     material            ,    1
## 23      methods           65    1
## 24      culture      methods    1
## 25      methods   1988;11:15    1
## 26      methods            .    1
## 27      methods          278    1
## 28      methods      2010;62    1
## 29      methods           78    1
## 30      methods          101    1
occurrences<-which(x$lemma=="method")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="method")
##           term1   term2 cooc
## 1             .  method  195
## 2        method     for  153
## 3        method       :   79
## 4             :  method   63
## 5        method      to   60
## 6        method     Mol   40
## 7        method       .   38
## 8        method Enzymol   29
## 9     sensitive  method   27
## 10       method      in   20
## 11       method      of   19
## 12       method  method   18
## 13       method     and   16
## 14       method       (   15
## 15          the  method   15
## 16            ;  method   14
## 17            a  method   12
## 18     standard  method   10
## 19            )  method   10
## 20          and  method    9
## 21       method    that    8
## 22         easy  method    8
## 23     analytic  method    8
## 24            &  method    8
## 25        vitro  method    8
## 26        assay  method    8
## 27            ,  method    7
## 28 nanotoxicity  method    7
## 29       revise  method    6
## 30       method       ,    6
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma method is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##           term1    term2 cooc
## 1             .   method  195
## 2        method      for  153
## 3        method        :   79
## 4             :   method   63
## 5        method       to   60
## 6        method      Mol   40
## 7        method        .   38
## 8        method  Enzymol   29
## 9     sensitive   method   27
## 10       method       in   20
## 11       method       of   19
## 12       method   method   18
## 13       method      and   16
## 14       method        (   15
## 15          the   method   15
## 16            ;   method   14
## 17            a   method   12
## 18     standard   method   10
## 19            )   method   10
## 20            . material    9
## 21     material      and    9
## 22          and   method    9
## 23       method     that    8
## 24         easy   method    8
## 25     analytic   method    8
## 26            &   method    8
## 27        vitro   method    8
## 28        assay   method    8
## 29            ,   method    7
## 30 nanotoxicity   method    7

Co-occurences for materials and material when it is the last lemma of the document

We could assume that the last occurrence in an article of the lemma “materials” correspond to the section title “material and methods”. As before, we will use co-occurrences see how words are connected to the last occurrence of “materials” in each documents, and see how often it correspond to a “materials and methods” section.

The first two functions select the last occurrence of a word in a document, and got the id of their sentences. A graph showing the connection of words for this subset of sentences is then plot.

create_subset_corpus_last_lemmas <- function(index){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for materials and material when it is the last lemma of the document
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  lemma<-occurrence$lemma
  occurrences_in_doc=which(x$doc_id==doc_id & x$lemma==lemma)
  last_occurrence=occurrences_in_doc[length(occurrences_in_doc)]
  if (last_occurrence==index){return(strip_corpus(doc_id, sentence_id))} 
  return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when it is the last lemma of the document")

head_cooc(stats, lemma="materials")
##            term1            term2 cooc
## 1      materials          science   39
## 2              .        materials   26
## 3             of        materials   26
## 4            and        materials   14
## 5              /        materials   13
## 6      materials                /   13
## 7      materials                &   12
## 8        methods        materials   11
## 9      materials         research    8
## 10        method        materials    7
## 11         apply        materials    6
## 12           for        materials    6
## 13    Biomedical        materials    5
## 14     materials characterization    5
## 15     materials                ,    5
## 16     materials        Chemistry    4
## 17     materials              and    4
## 18          this        materials    3
## 19     materials          section    3
## 20     materials             inc.    3
## 21            in        materials    3
## 22     materials       commercial    3
## 23          bulk        materials    2
## 24       various        materials    2
## 25     materials          Science    2
## 26             &        materials    2
## 27 Supplementary        materials    2
## 28             :        materials    2
## 29             ,        materials    2
## 30  \u0084\u0084        materials    2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when materials is the last lemma of the document")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##         term1            term2 cooc
## 1   materials          science   39
## 2           .        materials   26
## 3          of        materials   26
## 4    material              and   24
## 5         and           method   18
## 6           .         material   17
## 7      method              2.1   17
## 8         and        materials   14
## 9           /        materials   13
## 10  materials                /   13
## 11  materials                &   12
## 12    methods        materials   11
## 13  materials         research    8
## 14     method        materials    7
## 15      apply        materials    6
## 16        for        materials    6
## 17        and          methods    6
## 18          .           method    5
## 19 Biomedical        materials    5
## 20  materials characterization    5
## 21          &           method    5
## 22  materials                ,    5
## 23  materials        Chemistry    4
## 24          .          methods    4
## 25  materials              and    4
## 26          :         material    4
## 27       this        materials    3
## 28  materials          section    3
## 29 functional         material    3
## 30  materials             inc.    3
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when it is the last lemma of the document")

head_cooc(stats, lemma="material")
##            term1            term2 cooc
## 1       material              and  157
## 2              .         material  103
## 3       material                .  102
## 4             of         material   96
## 5       material               at   83
## 6       material               be   47
## 7  Supplementary         material   46
## 8       material                ,   39
## 9            the         material   38
## 10          this         material   29
## 11      material              for   23
## 12      material        available   22
## 13      material               in   19
## 14      nanotube         material   19
## 15      material                :   17
## 16      material                (   13
## 17 supplementary         material   13
## 18        method         material   11
## 19       genetic         material   11
## 20           and         material   11
## 21             t         material   11
## 22     nanosized         material   10
## 23     reference         material    9
## 24             :         material    9
## 25            in         material    8
## 26      material characterization    8
## 27      material            refer    8
## 28       section         material    7
## 29      material               as    7
## 30             ,         material    7
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when material is the last lemma of the document")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1     term2 cooc
## 1       material       and  157
## 2              .  material  103
## 3            and    method  102
## 4       material         .  102
## 5             of  material   96
## 6       material        at   83
## 7       material        be   47
## 8  Supplementary  material   46
## 9       material         ,   39
## 10           the  material   38
## 11          this  material   29
## 12           and   methods   26
## 13        method       2.1   25
## 14      material       for   23
## 15      material available   22
## 16      material        in   19
## 17      nanotube  material   19
## 18      material         :   17
## 19      material         (   13
## 20 supplementary  material   13
## 21        method  material   11
## 22       genetic  material   11
## 23           and  material   11
## 24             t  material   11
## 25     nanosized  material   10
## 26        method    animal    9
## 27     reference  material    9
## 28             :  material    9
## 29       methods         ,    9
## 30            in  material    8

Co-occurences for lemma materials and material when they are the first lemma of a sentence

Materials

create_subset_corpus <- function(index, target){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for lemma materials and material when they are the first lemma of a sentence
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following line query the first lemma of the sentence in the good document
  first_lemma<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[1],]$lemma
  if (first_lemma==target) {return(strip_corpus(doc_id, sentence_id))} 
  return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
                    target="materials")

subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for lemma materials when it is the first lemma of a sentence")

head_cooc(stats, lemma="materials")
##           term1      term2 cooc
## 1     materials          &    7
## 2             .  materials    6
## 3  \u0084\u0084  materials    4
## 4     materials    science    2
## 5             :  materials    2
## 6     materials Poloxamers    2
## 7     materials      today    1
## 8     materials        C60    1
## 9     materials          ,    1
## 10            )  materials    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##           term1        term2 cooc
## 1     materials            &    7
## 2             .    materials    6
## 3             &       method    5
## 4        method \u0084\u0084    4
## 5  \u0084\u0084    materials    4
## 6     materials      science    2
## 7             :    materials    2
## 8     materials   Poloxamers    2
## 9     materials        today    1
## 10       method            :    1
## 11    materials          C60    1
## 12    materials            ,    1
## 13            )    materials    1

Material

occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
                    target="material")

subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for lemma material when it is the first lemma of a sentence")

head_cooc(stats, lemma="material")
##               term1    term2 cooc
## 1          material      and  454
## 2                 . material  309
## 3            method material   91
## 4           methods material   53
## 5                 : material   33
## 6                 , material   24
## 7          material        .   23
## 8          material material   22
## 9                 ; material   20
## 10         material        ,   12
## 11             test material   10
## 12         material  science   10
## 13         material       be    9
## 14         material        &    7
## 15         material       in    7
## 16         material       on    6
## 17           animal material    6
## 18         material      the    6
## 19        Amorphous material    6
## 20 characterization material    5
## 21                ) material    5
## 22         material Chitosan    5
## 23         material      for    4
## 24             Test material    4
## 25             Nano material    4
## 26         material      Ptx    4
## 27         material Pristine    4
## 28              663 material    4
## 29         material        5    4
## 30         Chemical material    3
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##        term1            term2 cooc
## 1   material              and  454
## 2          .         material  309
## 3        and           method  297
## 4        and          methods  135
## 5     method         material   91
## 6    methods         material   53
## 7     method              2.1   34
## 8          :         material   33
## 9     method           animal   27
## 10         ,         material   24
## 11  material                .   23
## 12  material         material   22
## 13         ;         material   20
## 14    method         Chemical   16
## 15    method characterization   13
## 16  material                ,   12
## 17      test         material   10
## 18  material          science   10
## 19  material               be    9
## 20    method        Synthesis    8
## 21    method                :    7
## 22  material                &    7
## 23   methods      preparation    7
## 24  material               in    7
## 25    method      preparation    6
## 26  material               on    6
## 27    animal         material    6
## 28   methods             test    6
## 29  material              the    6
## 30 Amorphous         material    6

Conclusion